Tacotron 2
https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3121510%2F0bae893b-2454-1bfe-3155-9f918c64d7c0.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=b50ace47afb9c26eda68ee4d5da621a3
Tacotron 2, a neural network architecture for speech synthesis directly from text.
The system is composed of a recurrent sequence-to-sequence feature prediction network
that maps character embeddings to mel-scale spectrograms,
followed by a modified WaveNet model
acting as a vocoder to synthesize time-domain waveforms from those spectrograms.
Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech
以下、多分間違ってるので、ちゃんと読んだら修正
少し前のTTSモデルの基礎アーキテクチャ?
元は、linguistic, duration, and F0 features. (よくわからない)
MOS(Mean Opinion Score): 4.53 比較、professionally recorded speechは、MOS: 4.58
FYI